Read the training data and the pre-calculated tangent distance matrix:
load("zip.RData")
load("zip.dist.RData")
Perform MDS with tangent distance to reduce the original data to a 2D space:
library(MASS)
dat.mds <- isoMDS(zip.dist, k=2)
## initial value 34.218873
## final value 34.218662
## converged
library(ggplot2)
ggplot(as.data.frame(dat.mds$points), aes(V1, V2, color=dat$train[,1])) +
geom_point() + coord_fixed() + labs(color="digit")
The digits are well-separated, better than PCA (same as MDS with Euclidean distance).
Subsample 1000 digits for Isomap.
set.seed(10)
ind <- sample(1:nrow(dat$train), 1000)
x <- as.matrix(dat$train[ind, -1])
y <- dat$train[ind, 1]
Perform Isomap with k=5 to find a 2D embedding:
library(RDRToolbox)
dat.isomap <- Isomap(data=x, dims=2, k=5)
## Computing distance matrix ... done
## Building graph with shortest paths (using 5 nearest neighbours) ... done
## Computing low dimensional embedding ... done
## number of samples: 1000
## reduction from 256 to 2 dimensions
## number of connected components in graph: 1
ggplot(as.data.frame(dat.isomap$dim2), aes(V1, V2, color=y)) +
geom_point() + coord_fixed() + labs(color="digit")
Try different parameter values (k) for Isomap:
k.vals <- c(2,3,5,10,25,50)
dat.isomap <- list()
for (i in 1:length(k.vals)) {
dat.isomap[[i]] <- Isomap(data=x, dims=2, k=k.vals[i])$dim2
}
## Computing distance matrix ... done
## Building graph with shortest paths (using 2 nearest neighbours) ... done
## Computing low dimensional embedding ... done
## number of samples: 1000
## reduction from 256 to 2 dimensions
## number of connected components in graph: 2
## Computing distance matrix ... done
## Building graph with shortest paths (using 3 nearest neighbours) ... done
## Computing low dimensional embedding ... done
## number of samples: 1000
## reduction from 256 to 2 dimensions
## number of connected components in graph: 1
## Computing distance matrix ... done
## Building graph with shortest paths (using 5 nearest neighbours) ... done
## Computing low dimensional embedding ... done
## number of samples: 1000
## reduction from 256 to 2 dimensions
## number of connected components in graph: 1
## Computing distance matrix ... done
## Building graph with shortest paths (using 10 nearest neighbours) ... done
## Computing low dimensional embedding ... done
## number of samples: 1000
## reduction from 256 to 2 dimensions
## number of connected components in graph: 1
## Computing distance matrix ... done
## Building graph with shortest paths (using 25 nearest neighbours) ... done
## Computing low dimensional embedding ... done
## number of samples: 1000
## reduction from 256 to 2 dimensions
## number of connected components in graph: 1
## Computing distance matrix ... done
## Building graph with shortest paths (using 50 nearest neighbours) ... done
## Computing low dimensional embedding ... done
## number of samples: 1000
## reduction from 256 to 2 dimensions
## number of connected components in graph: 1
library(gridExtra)
p <- list()
for(i in 1:length(k.vals)) {
p[[i]] <- ggplot(as.data.frame(dat.isomap[[i]]), aes(V1, V2, color=y)) +
geom_point(size=.1) + coord_fixed() + guides(color=guide_legend(title="digit")) +
ggtitle(paste('k =', k.vals[i])) +
theme(legend.position="none", plot.title=element_text(size=10), axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8), axis.title=element_blank())
}
do.call(grid.arrange, c(p, ncol=3))
The results are sensitive to k.
Apply tSNE with perplexity = 25 to find a 2D embedding of digits data:
library(Rtsne)
set.seed(10)
dat.tsne <- Rtsne(dat$train[,-1], dims=2, perplexity=25, verbose=T)
## Read the 2219 x 50 data matrix successfully!
## Using no_dims = 2, perplexity = 25.000000, and theta = 0.500000
## Computing input similarities...
## Normalizing input...
## Building tree...
## - point 0 of 2219
## Done in 0.29 seconds (sparsity = 0.046597)!
## Learning embedding...
## Iteration 50: error is 79.918394 (50 iterations in 0.89 seconds)
## Iteration 100: error is 72.216926 (50 iterations in 0.82 seconds)
## Iteration 150: error is 71.396813 (50 iterations in 0.78 seconds)
## Iteration 200: error is 71.113253 (50 iterations in 0.79 seconds)
## Iteration 250: error is 70.967667 (50 iterations in 0.78 seconds)
## Iteration 300: error is 1.925602 (50 iterations in 0.71 seconds)
## Iteration 350: error is 1.638494 (50 iterations in 0.68 seconds)
## Iteration 400: error is 1.508843 (50 iterations in 0.70 seconds)
## Iteration 450: error is 1.436164 (50 iterations in 0.72 seconds)
## Iteration 500: error is 1.400877 (50 iterations in 0.72 seconds)
## Iteration 550: error is 1.380000 (50 iterations in 0.74 seconds)
## Iteration 600: error is 1.365542 (50 iterations in 0.75 seconds)
## Iteration 650: error is 1.354502 (50 iterations in 0.75 seconds)
## Iteration 700: error is 1.345434 (50 iterations in 0.75 seconds)
## Iteration 750: error is 1.338209 (50 iterations in 0.74 seconds)
## Iteration 800: error is 1.332063 (50 iterations in 0.75 seconds)
## Iteration 850: error is 1.326741 (50 iterations in 0.77 seconds)
## Iteration 900: error is 1.322192 (50 iterations in 0.76 seconds)
## Iteration 950: error is 1.318307 (50 iterations in 0.77 seconds)
## Iteration 1000: error is 1.314517 (50 iterations in 0.76 seconds)
## Fitting performed in 15.14 seconds.
ggplot(as.data.frame(dat.tsne$Y), aes(V1, V2, color=dat$train[,1])) +
geom_point() + coord_fixed() + labs(color="digit")
Try different perplexity values:
prep.vals <- c(2,5,10,25,50,100)
calc.tsne <- function(seed) {
dat.tsne <- list()
set.seed(seed)
for (i in 1:length(prep.vals)) {
dat.tsne[[i]] <- Rtsne(dat$train[,-1], dims=2, perplexity=prep.vals[i])$Y
}
dat.tsne
}
dat.tsne <- calc.tsne(seed=10)
plot.tsne <- function() {
p <- list()
for(i in 1:length(prep.vals)) {
p[[i]] <- ggplot(as.data.frame(dat.tsne[[i]]), aes(V1, V2, color=dat$train[,1])) +
geom_point(size=.1) + coord_fixed() + guides(color=guide_legend(title="digit")) +
ggtitle(paste('preplexity =', prep.vals[i])) +
theme(legend.position="none", plot.title=element_text(size=10), axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8), axis.title=element_blank())
}
do.call(grid.arrange, c(p, ncol=3))
}
plot.tsne()
The results are sensitive to perplexity.
Use different seeds, and the results are different since the global optimum is not guaranteed to be found.
dat.tsne <- calc.tsne(seed=100)
plot.tsne()
dat.tsne <- calc.tsne(seed=1000)
plot.tsne()
Apply KNN with one nearest neighbor to the test data:
library(class)
y.test.knn <- knn(dat$train[,-1], dat$test[,-1], dat$train[,1], k=1, prob=F)
Confusion matrix for the test data:
table(y.test.knn, dat$test[,1])
##
## y.test.knn 1 3 5
## 1 261 0 1
## 3 2 157 5
## 5 1 9 154
Misclassification rate for the test data:
mean(y.test.knn != dat$test[,1])
## [1] 0.03050847
Apply KNN with different k values (only include odd values since there would be much more ties when k is even):
set.seed(10)
k.vals <- seq(1, 21, by=2)
misrates <- sapply(k.vals, function(k) mean(knn(dat$train[,-1],dat$test[,-1],dat$train[,1],k=k,prob=F) != dat$test[,1]))
Plot the error rate against k:
qplot(k.vals, misrates, geom="line") + labs(x="k", y="error rate")
k = 1, 5 and 7 all yield same best result.
Function for plotting the digit data as an image:
conv.image <- function(vec)
{
mat <- matrix(as.numeric(vec), nrow=16, ncol=16)
mat <- -mat[, 16 : 1]
par(mar=c(0, 0, 0, 0))
image(mat, col=gray(seq(0, 1, 0.01)), xaxt='n', yaxt='n')
}
Here are digits that are actually 3’s but are misclassified into 5’s (k=1). Only the first one is plotted here, and the rest are saved into pdf files.
mis.three <- intersect(which(y.test.knn==5), which(dat$test[,1]==3))
conv.image(dat$test[mis.three[1], -1])
for (i in 1:length(mis.three)) {
pdf(paste0('mis.three.knn.', i, '.pdf'))
conv.image(dat$test[mis.three[i], -1])
dev.off()
}
Digits that are actually 5’s but are misclassified into 3’s:
mis.five <- intersect(which(y.test.knn==3), which(dat$test[,1]==5))
conv.image(dat$test[mis.five[1], -1])
for (i in 1:length(mis.five)) {
pdf(paste0('mis.five.knn.', i, '.pdf'))
conv.image(dat$test[mis.five[i], -1])
dev.off()
}
The QDA classifier stopped with error:
dat.qda <- qda(V1 ~ ., data=dat$train)
## Error in qda.default(x, grouping, ...): rank deficiency in group 1
Apply the LDA classifier and test it on the test data:
dat.lda <- lda(V1 ~ ., data=dat$train)
y.test.lda <- predict(dat.lda, dat$test)$class
Confusion matrix for the test data:
table(y.test.lda, dat$test[,1])
##
## y.test.lda 1 3 5
## 1 259 0 1
## 3 4 151 11
## 5 1 15 148
Misclassification rate for the test data:
mean(y.test.lda != dat$test[,1])
## [1] 0.05423729
Digits that are actually 3’s but are misclassified into 5’s:
mis.three <- intersect(which(y.test.lda==5), which(dat$test[,1]==3))
conv.image(dat$test[mis.three[1], -1])
for (i in 1:length(mis.three)) {
pdf(paste0('mis.three.lda.', i, '.pdf'))
conv.image(dat$test[mis.three[i], -1])
dev.off()
}
Digits that are actually 5’s but are misclassified into 3’s:
mis.five <- intersect(which(y.test.lda==3), which(dat$test[,1]==5))
conv.image(dat$test[mis.five[1], -1])
for (i in 1:length(mis.five)) {
pdf(paste0('mis.five.lda.', i, '.pdf'))
conv.image(dat$test[mis.five[i], -1])
dev.off()
}
Apply the High-Dimensional Regularized Discriminant Analysis, and use the train function in the caret package to tune the parameters. There are 3 tunable hyperparameters in this method: lambda (pooling parameter, which shifts the covariance-matrix towards pooled covariance or separate covariance), gamma (shrinkage parameter, which shifts the covriance-matrix towards/away from diagonal matrix), shrinkage_type (the type of covariance-matrix shrinkage, ridge or convex).
library(caret)
## Loading required package: lattice
set.seed(1234)
train.control <- trainControl(method="repeatedcv", number=5, repeats=5, search="random")
dat.hdrda <- train(V1 ~ ., data=dat$train, method="hdrda", trControl=train.control)
y.test.hdrda <- predict(dat.hdrda, dat$test)
The results of the tuning process:
dat.hdrda
## High-Dimensional Regularized Discriminant Analysis
##
## 2219 samples
## 256 predictor
## 3 classes: '1', '3', '5'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times)
## Summary of sample sizes: 1775, 1776, 1775, 1775, 1775, 1774, ...
## Resampling results across tuning parameters:
##
## gamma lambda shrinkage_type Accuracy Kappa
## 0.6092747 0.640310605 convex 0.9854015 0.9773447
## 0.6222994 0.860915384 ridge 0.9845002 0.9759442
## 0.6233794 0.009495756 convex 0.9912586 0.9864361
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were gamma = 0.6233794, lambda
## = 0.009495756 and shrinkage_type = convex.
Confusion matrix for the test data:
table(y.test.hdrda, dat$test[,1])
##
## y.test.hdrda 1 3 5
## 1 261 0 0
## 3 2 154 2
## 5 1 12 158
Misclassification rate for the test data:
mean(y.test.hdrda != dat$test[,1])
## [1] 0.02881356